Goto

Collaborating Authors

 citation quality



Deep Literature Survey Automation with an Iterative Workflow

Zhang, Hongbo, Cui, Han, Wang, Yidong, Tian, Yijian, Guo, Qi, Wang, Cunxiang, Wu, Jian, Song, Chiyu, Zhang, Yue

arXiv.org Artificial Intelligence

Automatic literature survey generation has attracted increasing attention, yet most existing systems follow a one-shot paradigm, where a large set of papers is retrieved at once and a static outline is generated before drafting. This design often leads to noisy retrieval, fragmented structures, and context overload, ultimately limiting survey quality. Inspired by the iterative reading process of human researchers, we propose \ours, a framework based on recurrent outline generation, in which a planning agent incrementally retrieves, reads, and updates the outline to ensure both exploration and coherence. To provide faithful paper-level grounding, we design paper cards that distill each paper into its contributions, methods, and findings, and introduce a review-and-refine loop with visualization enhancement to improve textual flow and integrate multimodal elements such as figures and tables. Experiments on both established and emerging topics show that \ours\ substantially outperforms state-of-the-art baselines in content coverage, structural coherence, and citation quality, while producing more accessible and better-organized surveys. To provide a more reliable assessment of such improvements, we further introduce Survey-Arena, a pairwise benchmark that complements absolute scoring and more clearly positions machine-generated surveys relative to human-written ones. The code is available at https://github.com/HancCui/IterSurvey\_Autosurveyv2.



SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models

Bao, Tong, Nayeem, Mir Tafseer, Rafiei, Davood, Zhang, Chengzhi

arXiv.org Artificial Intelligence

Automatic survey generation has emerged as a key task in scientific document processing. While large language models (LLMs) have shown promise in generating survey texts, the lack of standardized evaluation datasets critically hampers rigorous assessment of their performance against human-written surveys. In this work, we present SurveyGen, a large-scale dataset comprising over 4,200 human-written surveys across diverse scientific domains, along with 242,143 cited references and extensive quality-related metadata for both the surveys and the cited papers. Leveraging this resource, we build QUAL-SG, a novel quality-aware framework for survey generation that enhances the standard Retrieval-Augmented Generation (RAG) pipeline by incorporating quality-aware indicators into literature retrieval to assess and select higher-quality source papers. Using this dataset and framework, we systematically evaluate state-of-the-art LLMs under varying levels of human involvement - from fully automatic generation to human-guided writing. Experimental results and human evaluations show that while semi-automatic pipelines can achieve partially competitive outcomes, fully automatic survey generation still suffers from low citation quality and limited critical analysis.


Lessons from Training Grounded LLMs with Verifiable Rewards

Sim, Shang Hong, Pala, Tej Deep, Toh, Vernon, Chieu, Hai Leong, Zadeh, Amir, Li, Chuan, Majumder, Navonil, Poria, Soujanya

arXiv.org Artificial Intelligence

Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.


MedCite: Can Language Models Generate Verifiable Text for Medicine?

Wang, Xiao, Tan, Mengjue, Jin, Qiao, Xiong, Guangzhi, Hu, Yu, Zhang, Aidong, Lu, Zhiyong, Zhang, Minjia

arXiv.org Artificial Intelligence

Existing LLM-based medical question-answering systems lack citation generation and evaluation capabilities, raising concerns about their adoption in practice. In this work, we introduce \name, the first end-to-end framework that facilitates the design and evaluation of citation generation with LLMs for medical tasks. Meanwhile, we introduce a novel multi-pass retrieval-citation method that generates high-quality citations. Our evaluation highlights the challenges and opportunities of citation generation for medical tasks, while identifying important design choices that have a significant impact on the final citation quality. Our proposed method achieves superior citation precision and recall improvements compared to strong baseline methods, and we show that evaluation results correlate well with annotation results from professional experts.


CiteEval: Principle-Driven Citation Evaluation for Source Attribution

Xu, Yumo, Qi, Peng, Chen, Jifan, Liu, Kunlun, Han, Rujun, Liu, Lan, Min, Bonan, Castelli, Vittorio, Gupta, Arshit, Wang, Zhiguo

arXiv.org Artificial Intelligence

Citation quality is crucial in information-seeking systems, directly influencing trust and the effectiveness of information access. Current evaluation frameworks, both human and automatic, mainly rely on Natural Language Inference (NLI) to assess binary or ternary supportiveness from cited sources, which we argue is a suboptimal proxy for citation evaluation. In this work we introduce CiteEval, a citation evaluation framework driven by principles focusing on fine-grained citation assessment within a broad context, encompassing not only the cited sources but the full retrieval context, user query, and generated text. Guided by the proposed framework, we construct CiteBench, a multi-domain benchmark with high-quality human annotations on citation quality. To enable efficient evaluation, we further develop CiteEval-Auto, a suite of model-based metrics that exhibit strong correlation with human judgments. Experiments across diverse systems demonstrate CiteEval-Auto's superior ability to capture the multifaceted nature of citations compared to existing metrics, offering a principled and scalable approach to evaluate and improve model-generated citations.


PA-RAG: RAG Alignment via Multi-Perspective Preference Optimization

Wu, Jiayi, Cai, Hengyi, Yan, Lingyong, Sun, Hao, Li, Xiang, Wang, Shuaiqiang, Yin, Dawei, Gao, Ming

arXiv.org Artificial Intelligence

The emergence of Retrieval-augmented generation (RAG) has alleviated the issues of outdated and hallucinatory content in the generation of large language models (LLMs), yet it still reveals numerous limitations. When a general-purpose LLM serves as the RAG generator, it often suffers from inadequate response informativeness, response robustness, and citation quality. Past approaches to tackle these limitations, either by incorporating additional steps beyond generating responses or optimizing the generator through supervised fine-tuning (SFT), still failed to align with the RAG requirement thoroughly. Consequently, optimizing the RAG generator from multiple preference perspectives while maintaining its end-to-end LLM form remains a challenge. To bridge this gap, we propose Multiple Perspective Preference Alignment for Retrieval-Augmented Generation (PA-RAG), a method for optimizing the generator of RAG systems to align with RAG requirements comprehensively. Specifically, we construct high-quality instruction fine-tuning data and multi-perspective preference data by sampling varied quality responses from the generator across different prompt documents quality scenarios. Subsequently, we optimize the generator using SFT and Direct Preference Optimization (DPO). Extensive experiments conducted on four question-answer datasets across three LLMs demonstrate that PA-RAG can significantly enhance the performance of RAG generators. Our code and datasets are available at https://github.com/wujwyi/PA-RAG.


Advancing Large Language Model Attribution through Self-Improving

Huang, Lei, Feng, Xiaocheng, Ma, Weitao, Zhao, Liang, Fan, Yuchun, Zhong, Weihong, Xu, Dongliang, Yang, Qing, Liu, Hongtao, Qin, Bing

arXiv.org Artificial Intelligence

Teaching large language models (LLMs) to generate text with citations to evidence sources can mitigate hallucinations and enhance verifiability in information-seeking systems. However, improving this capability requires high-quality attribution data, which is costly and labor-intensive. Inspired by recent advances in self-improvement that enhance LLMs without manual annotation, we present START, a Self-Taught AttRibuTion framework for iteratively improving the attribution capability of LLMs. First, to prevent models from stagnating due to initially insufficient supervision signals, START leverages the model to self-construct synthetic training data for warming up. To further self-improve the model's attribution ability, START iteratively utilizes fine-grained preference supervision signals constructed from its sampled responses to encourage robust, comprehensive, and attributable generation. Experiments on three open-domain question-answering datasets, covering long-form QA and multi-step reasoning, demonstrate significant performance gains of 25.13% on average without relying on human annotations and more advanced models. Further analysis reveals that START excels in aggregating information across multiple sources.


On the Capacity of Citation Generation by Large Language Models

Qian, Haosheng, Fan, Yixing, Zhang, Ruqing, Guo, Jiafeng

arXiv.org Artificial Intelligence

Retrieval-augmented generation (RAG) appears as a promising method to alleviate the "hallucination" problem in large language models (LLMs), since it can incorporate external traceable resources for response generation. The essence of RAG in combating the hallucination issue lies in accurately attributing claims in responses to the corresponding retrieved documents. However, most of existing works focus on improving the quality of generated responses from the LLM, while largely overlooked its ability to attribute sources accurately. In this study, we conduct a systematic analysis about the capabilities of LLMs in generating citations within response generation, and further introduce a novel method to enhance their citation generation abilities. Specifically, we evaluate both the correctness and citation quality for seven widely-used LLMs on two benchmark datasets. Meanwhile, we introduce new citation evaluation metrics to eliminate the over-penalization of unnecessary and excessive citations in existing metrics. Furthermore, we propose a Generate-then-Refine method that completes relevant citations and removes irrelevant ones without altering the response text. The results on WebGLM-QA, ASQA and ELI5 datasets show that our method substantially improves the quality of citations in responses generated by LLMs.